This dataset contains a variety of information customers regarding credit risk default which is the risk that a lender takes the chance that a borrower fails to make required payments of the loan. In credit risk case, loan default is a problem of unbalanced data. To deal with it, we can use logistic regression.
There are 32,581 records with 12 columns. This project aims to know risk factors affect loan default and also make prediction probability occurrence or non-occurrence of loan default based on existing predictor values. The data set can be discovered in here
library(readr)
data <- read.csv("credit_risk_dataset.csv")library(DT)
datatable(data,options=list(pageLength=10,scrollX='400px'),filter='top')## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
person_age: age
person_income: annual
income
person_home_ownership: home
onwership
person_emp_length: employment
length in years
loan_intent: loan intent
loan_grade: loan_grade
loan_amnt: loan_amount
loan_int_rate: interest
rate
loan_status: loan status (0 is non
default and 1 is default)
loan_percent_income:
percent_income
cb_person_default_on_file:
historical default
cb_person_cred_hist_length: credit
history length
library(dplyr)
glimpse(data)## Rows: 32,581
## Columns: 12
## $ person_age <int> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22,…
## $ person_income <int> 59000, 9600, 9600, 65500, 54400, 9900, 7710…
## $ person_home_ownership <chr> "RENT", "OWN", "MORTGAGE", "RENT", "RENT", …
## $ person_emp_length <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4,…
## $ loan_intent <chr> "PERSONAL", "EDUCATION", "MEDICAL", "MEDICA…
## $ loan_grade <chr> "D", "B", "C", "C", "C", "A", "B", "B", "A"…
## $ loan_amnt <int> 35000, 1000, 5500, 35000, 35000, 2500, 3500…
## $ loan_int_rate <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12…
## $ loan_status <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
## $ loan_percent_income <dbl> 0.59, 0.10, 0.57, 0.53, 0.55, 0.25, 0.45, 0…
## $ cb_person_default_on_file <chr> "Y", "N", "N", "N", "Y", "N", "N", "N", "N"…
## $ cb_person_cred_hist_length <int> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, 4, 4…
We can see dataset contain several character variables that need to transformed into factor variables.
# Encode `loan_status` 0 == non default, 1 == default
data$loan_status <- ifelse(data$loan_status == 0,"non default","default")# Transform categorical data
library(dplyr)
data <- data %>%
mutate_at(vars(person_home_ownership,loan_intent,loan_grade,loan_status,cb_person_default_on_file),as.factor)
glimpse(data)## Rows: 32,581
## Columns: 12
## $ person_age <int> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21, 22,…
## $ person_income <int> 59000, 9600, 9600, 65500, 54400, 9900, 7710…
## $ person_home_ownership <fct> RENT, OWN, MORTGAGE, RENT, RENT, OWN, RENT,…
## $ person_emp_length <dbl> 123, 5, 1, 4, 8, 2, 8, 5, 8, 6, 6, 2, 2, 4,…
## $ loan_intent <fct> PERSONAL, EDUCATION, MEDICAL, MEDICAL, MEDI…
## $ loan_grade <fct> D, B, C, C, C, A, B, B, A, D, B, A, A, E, A…
## $ loan_amnt <int> 35000, 1000, 5500, 35000, 35000, 2500, 3500…
## $ loan_int_rate <dbl> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12…
## $ loan_status <fct> default, non default, default, default, def…
## $ loan_percent_income <dbl> 0.59, 0.10, 0.57, 0.53, 0.55, 0.25, 0.45, 0…
## $ cb_person_default_on_file <fct> Y, N, N, N, Y, N, N, N, N, N, N, N, N, N, N…
## $ cb_person_cred_hist_length <int> 3, 2, 3, 2, 4, 2, 3, 4, 2, 3, 4, 2, 2, 4, 4…
colSums(is.na(data))## person_age person_income
## 0 0
## person_home_ownership person_emp_length
## 0 895
## loan_intent loan_grade
## 0 0
## loan_amnt loan_int_rate
## 0 3116
## loan_status loan_percent_income
## 0 0
## cb_person_default_on_file cb_person_cred_hist_length
## 0 0
There are missing values in person_emp_length and
loan_int_rate. Some scenarios for handling missing values,
one of which we can impute missing values with the most frequent value
for categorical variables. While continuous variables impute with the
median of the column values.
#`person_emp_length`, NaN values are replaced with the most frequent value (mode)
data$person_emp_length[is.na(data$person_emp_length)] <- as.numeric(names(which.max(table(data$person_emp_length))))#`loan_int_rate`, NaN values are replaced with the the median of the column values
data$loan_int_rate[is.na(data$loan_int_rate)] <- median(data$loan_int_rate,na.rm=TRUE)# Check missing values are replaced
colSums(is.na(data))## person_age person_income
## 0 0
## person_home_ownership person_emp_length
## 0 0
## loan_intent loan_grade
## 0 0
## loan_amnt loan_int_rate
## 0 0
## loan_status loan_percent_income
## 0 0
## cb_person_default_on_file cb_person_cred_hist_length
## 0 0
All variables with different data types have been converted to the desired data type and there are no missing values.
library(ggplot2)
library(ggpubr)
library(patchwork)fig_1 <- ggplot(data,aes(x=person_age)) +
geom_histogram(aes(y=after_stat(density)),color="skyblue",fill="skyblue") + geom_density(alpha=0.2) + ggtitle("Figure 1: The density of age person")
fig_1
Figure 1 shows the majority of people are 20 to 60 years old. People who have age more than 120 can be indicated outliers and it will be deleted.
fig_2 <- ggplot(data,aes(x=person_income)) +
geom_histogram(aes(y=after_stat(density)),color="skyblue",fill="skyblue") +
geom_density(alpha=0.2,fill="skyblue") + ggtitle("Figure 2: The density of annual income person")
fig_2
Figure 2 depicts there is indication possible outliers to people who have income more than 4 millions and it will be dropped.
fig_3 <- ggplot(data,aes(x=person_emp_length)) +
geom_histogram(aes(y=after_stat(density)),color="skyblue",fill="skyblue") +
geom_density(alpha=0.2,fill="skyblue") + ggtitle("Figure 3: The density of employment length")
fig_3Figure 3 represents there is also indication possible outliers. People have employment length more than 60 years will be removed.
# Outliers will be removed
person_age_out <- which(data$person_age > 100)
person_income_out <- which(data$person_income > 4000000)
emp_length_out <- which(data$person_emp_length > 60)
data_cr <- data[-c(person_age_out,person_income_out,emp_length_out),]library(ggcorrplot)
model.matrix(~0+., data=data_cr) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = FALSE, type="lower",tl.cex=5, lab=TRUE,
lab_size=1.5) + ggtitle("Figure 4: Correlation between variables")
As we can see, some variables have correlaction to
loan_status such as
person_home_ownership,loan_intent,loan_grade,loan_amnt,
and loan_int_rate.
fig_loan <- ggplot(data_cr,aes(x=loan_status, fill=loan_status)) + geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)),
colour = "white", size = 3.5, position=position_stack(vjust=0.5)) +
ggtitle("Figure 5: Loan Status")
fig_loan
The count of people with default loan is less than those with non-default status.
# Relation between person_home_ownership with loan status
fig_ph <- ggplot(data_cr,aes(x=person_home_ownership, fill=loan_status)) + geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)),
colour = "white", size = 2.5, position=position_stack(vjust=0.5)) +
ggtitle("Figure 6 : Person home ownership by loan status")
# Relation between loan grade with loan status
fig_grade <- ggplot(data_cr,aes(x=loan_grade, fill=loan_status)) + geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)),
colour = "white", size = 2.5, position=position_stack(vjust=0.5)) +
ggtitle("Figure 7: Loan grade by loan status")
fig_mix_1 <- fig_ph / fig_grade
fig_mix_1
Figure 6 displays mostly people with default and non-default loan have mortgage and rent for home ownership. In figure 7, the number of people with non-default loan decrease as the loan grade increasing. It occurs the increasing and decreasing in the number of people with default loan as the loan grade enhancing.
# Relation between loan intent with loan status
fig_int <- ggplot(data_cr,aes(x=loan_intent, fill=loan_status)) + geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)),
colour = "white", size = 2.5, position=position_stack(vjust=0.5)) +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Figure 8: Loan intent by loan status")
fig_int
In figure 8, people with non-default more tend to take loan for education necessary and people with default mostly loan used for medical.
# Relation between interest rate with loan status
fig_ir <- ggplot(data_cr,aes(x=loan_status, y=loan_int_rate, fill=loan_status)) + geom_boxplot() +
ggtitle("Figure 9: Interest rate by loan status") + coord_flip()
# Relation between loan amount with loan status
fig_am <- ggplot(data_cr,aes(x=loan_status, y=loan_amnt, fill=loan_status)) + geom_boxplot() +
ggtitle("Figure 10: Loan amount by loan status") + coord_flip()
fig_mix_2 <- fig_ir / fig_am
fig_mix_2The boxplot in figure 9 above appears to indicate that people with default have a higher interest rate than people with non-default. Figure 10 depicts that people with non-default have less than loan amount than people with default.
We can split dataset into the training set used to build model and testing set used to examine the model has been obtained. We can divide 80% for training set and 20% for testing set.
library(dplyr)
set.seed(123)
split <- sample(nrow(data_cr),size=nrow(data_cr)*0.8)
train <- data_cr[split,]
head(train)## person_age person_income person_home_ownership person_emp_length
## 18853 27 27840 MORTGAGE 2
## 18901 30 12000 RENT 0
## 26809 35 114000 MORTGAGE 11
## 25108 29 74000 MORTGAGE 3
## 28873 35 48000 RENT 6
## 2992 21 30000 RENT 3
## loan_intent loan_grade loan_amnt loan_int_rate loan_status
## 18853 MEDICAL A 2400 5.42 non default
## 18901 MEDICAL C 1800 14.35 non default
## 26809 EDUCATION B 7000 11.49 non default
## 25108 EDUCATION D 3000 16.32 default
## 28873 EDUCATION A 17000 7.88 default
## 2992 MEDICAL B 3200 11.14 non default
## loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
## 18853 0.09 N 7
## 18901 0.15 Y 5
## 26809 0.06 N 10
## 25108 0.04 N 6
## 28873 0.35 N 8
## 2992 0.11 N 3
test <- data_cr[-split,]
renderDT(test,options=list(pageLength=10,scrollX='400px'),filter='top')model_loan <- glm(formula = loan_status ~ . ,family="binomial", data = train)
summary(model_loan)##
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.102e+00 1.921e-01 21.354 < 2e-16 ***
## person_age 5.292e-03 6.482e-03 0.816 0.41427
## person_income -1.792e-06 5.692e-07 -3.149 0.00164 **
## person_home_ownershipOTHER -7.043e-01 3.004e-01 -2.344 0.01907 *
## person_home_ownershipOWN 1.696e+00 1.132e-01 14.985 < 2e-16 ***
## person_home_ownershipRENT -8.174e-01 4.473e-02 -18.273 < 2e-16 ***
## person_emp_length 1.571e-02 5.312e-03 2.958 0.00310 **
## loan_intentEDUCATION 8.485e-01 6.395e-02 13.269 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -5.854e-02 7.069e-02 -0.828 0.40767
## loan_intentMEDICAL 1.527e-01 6.033e-02 2.532 0.01135 *
## loan_intentPERSONAL 6.469e-01 6.553e-02 9.871 < 2e-16 ***
## loan_intentVENTURE 1.063e+00 6.880e-02 15.451 < 2e-16 ***
## loan_gradeB -2.312e-01 7.063e-02 -3.274 0.00106 **
## loan_gradeC -4.112e-01 1.007e-01 -4.082 4.46e-05 ***
## loan_gradeD -2.544e+00 1.230e-01 -20.685 < 2e-16 ***
## loan_gradeE -2.740e+00 1.596e-01 -17.168 < 2e-16 ***
## loan_gradeF -2.983e+00 2.341e-01 -12.742 < 2e-16 ***
## loan_gradeG -1.705e+01 1.026e+02 -0.166 0.86808
## loan_amnt 1.102e-04 5.052e-06 21.815 < 2e-16 ***
## loan_int_rate -6.113e-02 1.426e-02 -4.288 1.81e-05 ***
## loan_percent_income -1.345e+01 2.916e-01 -46.124 < 2e-16 ***
## cb_person_default_on_fileY -1.709e-03 5.616e-02 -0.030 0.97572
## cb_person_cred_hist_length -4.066e-03 9.858e-03 -0.412 0.68005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27209 on 26058 degrees of freedom
## Residual deviance: 17597 on 26036 degrees of freedom
## AIC: 17643
##
## Number of Fisher Scoring iterations: 13
This section explains testing to check coefficient significance using hypothesis null is there is no predictor variables affect response variable (loan status). Alternative hypothesis is at least one predictor variable influences loan status.
library(pscl)
pR2(model_loan,c("McFadden"))## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -8.798344e+03 -1.360432e+04 9.611951e+03 3.532683e-01 3.084732e-01
## r2CU
## 4.760408e-01
cat('chi square value:',qchisq(0.95,11))## chi square value: 19.67514
Based on result performs model loan has \(G^{2}\) value more than chi-square table value with significance level 0.05. Therefore, decision to reject hypothesis null means at least one predictor variable affects significantly to status loan.
We do testing to know significance each predictor variables to response variable. Hypothesis null is the i-th predictor variables doesn’t have significant influencing to response variable (loan status). While, alternative hypothesis is at least one i-th predictor variable has significant influencing to loan status.
summary(model_loan)##
## Call:
## glm(formula = loan_status ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.102e+00 1.921e-01 21.354 < 2e-16 ***
## person_age 5.292e-03 6.482e-03 0.816 0.41427
## person_income -1.792e-06 5.692e-07 -3.149 0.00164 **
## person_home_ownershipOTHER -7.043e-01 3.004e-01 -2.344 0.01907 *
## person_home_ownershipOWN 1.696e+00 1.132e-01 14.985 < 2e-16 ***
## person_home_ownershipRENT -8.174e-01 4.473e-02 -18.273 < 2e-16 ***
## person_emp_length 1.571e-02 5.312e-03 2.958 0.00310 **
## loan_intentEDUCATION 8.485e-01 6.395e-02 13.269 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -5.854e-02 7.069e-02 -0.828 0.40767
## loan_intentMEDICAL 1.527e-01 6.033e-02 2.532 0.01135 *
## loan_intentPERSONAL 6.469e-01 6.553e-02 9.871 < 2e-16 ***
## loan_intentVENTURE 1.063e+00 6.880e-02 15.451 < 2e-16 ***
## loan_gradeB -2.312e-01 7.063e-02 -3.274 0.00106 **
## loan_gradeC -4.112e-01 1.007e-01 -4.082 4.46e-05 ***
## loan_gradeD -2.544e+00 1.230e-01 -20.685 < 2e-16 ***
## loan_gradeE -2.740e+00 1.596e-01 -17.168 < 2e-16 ***
## loan_gradeF -2.983e+00 2.341e-01 -12.742 < 2e-16 ***
## loan_gradeG -1.705e+01 1.026e+02 -0.166 0.86808
## loan_amnt 1.102e-04 5.052e-06 21.815 < 2e-16 ***
## loan_int_rate -6.113e-02 1.426e-02 -4.288 1.81e-05 ***
## loan_percent_income -1.345e+01 2.916e-01 -46.124 < 2e-16 ***
## cb_person_default_on_fileY -1.709e-03 5.616e-02 -0.030 0.97572
## cb_person_cred_hist_length -4.066e-03 9.858e-03 -0.412 0.68005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27209 on 26058 degrees of freedom
## Residual deviance: 17597 on 26036 degrees of freedom
## AIC: 17643
##
## Number of Fisher Scoring iterations: 13
Model loan represents that 8 out of the 11 predictors are
significantly associated to loan status. Variables exclude:
person_age, cb_person_default_on_file for Y
category, and cb_person_cred_hist_length.
#Rebuilding model using significant predictor variables to loan status
model_loan_final <- glm(formula = loan_status ~ person_income + person_home_ownership + person_emp_length + loan_intent + loan_grade +
loan_amnt + loan_int_rate + loan_percent_income,
family="binomial", data = train)
summary(model_loan_final)##
## Call:
## glm(formula = loan_status ~ person_income + person_home_ownership +
## person_emp_length + loan_intent + loan_grade + loan_amnt +
## loan_int_rate + loan_percent_income, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.220e+00 1.410e-01 29.923 < 2e-16 ***
## person_income -1.750e-06 5.663e-07 -3.090 0.00200 **
## person_home_ownershipOTHER -7.036e-01 3.001e-01 -2.345 0.01903 *
## person_home_ownershipOWN 1.695e+00 1.131e-01 14.986 < 2e-16 ***
## person_home_ownershipRENT -8.166e-01 4.472e-02 -18.258 < 2e-16 ***
## person_emp_length 1.632e-02 5.270e-03 3.096 0.00196 **
## loan_intentEDUCATION 8.449e-01 6.385e-02 13.233 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -5.499e-02 7.060e-02 -0.779 0.43598
## loan_intentMEDICAL 1.543e-01 6.032e-02 2.558 0.01053 *
## loan_intentPERSONAL 6.472e-01 6.552e-02 9.878 < 2e-16 ***
## loan_intentVENTURE 1.062e+00 6.879e-02 15.438 < 2e-16 ***
## loan_gradeB -2.313e-01 7.062e-02 -3.275 0.00106 **
## loan_gradeC -4.113e-01 9.672e-02 -4.252 2.12e-05 ***
## loan_gradeD -2.544e+00 1.195e-01 -21.284 < 2e-16 ***
## loan_gradeE -2.739e+00 1.571e-01 -17.438 < 2e-16 ***
## loan_gradeF -2.981e+00 2.327e-01 -12.808 < 2e-16 ***
## loan_gradeG -1.705e+01 1.026e+02 -0.166 0.86807
## loan_amnt 1.102e-04 5.049e-06 21.831 < 2e-16 ***
## loan_int_rate -6.124e-02 1.426e-02 -4.295 1.74e-05 ***
## loan_percent_income -1.345e+01 2.914e-01 -46.159 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27209 on 26058 degrees of freedom
## Residual deviance: 17598 on 26039 degrees of freedom
## AIC: 17638
##
## Number of Fisher Scoring iterations: 13
# Simultaneous Parameter Significance Test
library(pscl)
pR2(model_loan_final,c("McFadden"))## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -8.798868e+03 -1.360432e+04 9.610901e+03 3.532298e-01 3.084454e-01
## r2CU
## 4.759978e-01
cat('chi-square value:',qchisq(0.95,8))## chi-square value: 15.50731
# VIF values
library(car)
vif(model_loan_final)## GVIF Df GVIF^(1/(2*Df))
## person_income 1.665524 1 1.290552
## person_home_ownership 1.175679 3 1.027341
## person_emp_length 1.071110 1 1.034944
## loan_intent 1.059169 5 1.005765
## loan_grade 5.507077 6 1.152772
## loan_amnt 2.676186 1 1.635905
## loan_int_rate 5.062785 1 2.250063
## loan_percent_income 2.646835 1 1.626910
The predictor variables in model have a VIF value less 5, we can assume that multicollinearity is not an issue in the final model.
# Partial Parameter Significance Test
summary(model_loan_final)##
## Call:
## glm(formula = loan_status ~ person_income + person_home_ownership +
## person_emp_length + loan_intent + loan_grade + loan_amnt +
## loan_int_rate + loan_percent_income, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.220e+00 1.410e-01 29.923 < 2e-16 ***
## person_income -1.750e-06 5.663e-07 -3.090 0.00200 **
## person_home_ownershipOTHER -7.036e-01 3.001e-01 -2.345 0.01903 *
## person_home_ownershipOWN 1.695e+00 1.131e-01 14.986 < 2e-16 ***
## person_home_ownershipRENT -8.166e-01 4.472e-02 -18.258 < 2e-16 ***
## person_emp_length 1.632e-02 5.270e-03 3.096 0.00196 **
## loan_intentEDUCATION 8.449e-01 6.385e-02 13.233 < 2e-16 ***
## loan_intentHOMEIMPROVEMENT -5.499e-02 7.060e-02 -0.779 0.43598
## loan_intentMEDICAL 1.543e-01 6.032e-02 2.558 0.01053 *
## loan_intentPERSONAL 6.472e-01 6.552e-02 9.878 < 2e-16 ***
## loan_intentVENTURE 1.062e+00 6.879e-02 15.438 < 2e-16 ***
## loan_gradeB -2.313e-01 7.062e-02 -3.275 0.00106 **
## loan_gradeC -4.113e-01 9.672e-02 -4.252 2.12e-05 ***
## loan_gradeD -2.544e+00 1.195e-01 -21.284 < 2e-16 ***
## loan_gradeE -2.739e+00 1.571e-01 -17.438 < 2e-16 ***
## loan_gradeF -2.981e+00 2.327e-01 -12.808 < 2e-16 ***
## loan_gradeG -1.705e+01 1.026e+02 -0.166 0.86807
## loan_amnt 1.102e-04 5.049e-06 21.831 < 2e-16 ***
## loan_int_rate -6.124e-02 1.426e-02 -4.295 1.74e-05 ***
## loan_percent_income -1.345e+01 2.914e-01 -46.159 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27209 on 26058 degrees of freedom
## Residual deviance: 17598 on 26039 degrees of freedom
## AIC: 17638
##
## Number of Fisher Scoring iterations: 13
Model final loan loads eight predictor variables can be said as best model for further analysis due to meet parameter significance test simultaneously and partially.
The goodness of fit aims to see whether the model is feasible or not. Using null hypothesis, there is no significant difference between the observed and the possible prediction results of model. Alternative hypothesis, there is significant difference between the observed and the possible prediction results of model.
library(generalhoslem)
loan_model<- model.frame(model_loan_final)
logitgof(loan_model$loan_status, fitted(model_loan_final))##
## Hosmer and Lemeshow test (binary model)
##
## data: loan_model$loan_status, fitted(model_loan_final)
## X-squared = 157.14, df = 8, p-value < 2.2e-16
Calculation odd ratio is used to obtain model interpretation.
coef_loan <- coef(model_loan_final)
odd_model_loan <- cbind(coef_loan, OR = exp(coef_loan), SK= exp(confint(model_loan_final)))## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
odd_model_loan## coef_loan OR 2.5 % 97.5 %
## (Intercept) 4.220163e+00 6.804460e+01 5.162817e+01 8.975523e+01
## person_income -1.750243e-06 9.999982e-01 9.999972e-01 9.999995e-01
## person_home_ownershipOTHER -7.036082e-01 4.947967e-01 2.769844e-01 8.988487e-01
## person_home_ownershipOWN 1.695464e+00 5.449174e+00 4.380375e+00 6.826495e+00
## person_home_ownershipRENT -8.165566e-01 4.419508e-01 4.047568e-01 4.823238e-01
## person_emp_length 1.631829e-02 1.016452e+00 1.006037e+00 1.027039e+00
## loan_intentEDUCATION 8.449275e-01 2.327809e+00 2.054448e+00 2.638765e+00
## loan_intentHOMEIMPROVEMENT -5.499305e-02 9.464917e-01 8.243400e-01 1.087177e+00
## loan_intentMEDICAL 1.542766e-01 1.166814e+00 1.036708e+00 1.313249e+00
## loan_intentPERSONAL 6.471735e-01 1.910134e+00 1.680317e+00 2.172406e+00
## loan_intentVENTURE 1.061878e+00 2.891797e+00 2.528093e+00 3.310600e+00
## loan_gradeB -2.312828e-01 7.935151e-01 6.907476e-01 9.110888e-01
## loan_gradeC -4.113108e-01 6.627809e-01 5.481741e-01 8.009342e-01
## loan_gradeD -2.544012e+00 7.855060e-02 6.209405e-02 9.921065e-02
## loan_gradeE -2.738768e+00 6.464993e-02 4.747209e-02 8.787218e-02
## loan_gradeF -2.980906e+00 5.074684e-02 3.204839e-02 7.984281e-02
## loan_gradeG -1.704798e+01 3.945980e-08 1.504092e-23 2.328060e-06
## loan_amnt 1.102384e-04 1.000110e+00 1.000100e+00 1.000120e+00
## loan_int_rate -6.123762e-02 9.405997e-01 9.146924e-01 9.672699e-01
## loan_percent_income -1.345113e+01 1.439622e-06 8.157696e-07 2.563999e-06
Performance model can be evaluated with make prediction using testing set.
prob_loan <- predict(model_loan_final, type= "response", newdata = test)
pred_loan <- ifelse(prob_loan > 0.5, "non default", "default")
test_loan <- data.frame(pred_loan,test$loan_status)
head(test_loan)## pred_loan test.loan_status
## 13 non default default
## 15 non default non default
## 26 default default
## 35 non default non default
## 36 non default default
## 41 default default
loan <- table(pred_values = pred_loan, actual_values = test$loan_status)
loan## actual_values
## pred_values default non default
## default 818 228
## non default 655 4814
Confusion Matrix output represents for predict loan status category default can predict correctly 818 out of 1473 while for category non default can predict correctly 4814 out of 5042.
#number of true positives/(number of true positives + number of false negatives)
sensitivity <- loan[2,2]/sum(loan[2,])*100
#number of true negatives/(number of true negatives + number of false positives)
specificity <- loan[1,1]/sum(loan[1,])*100
#number of true negatives + number of true positives/ (number of true positives + number of false positives + number of true negatives + number of false negatives)
accuracy <- (loan[1,1]+loan[2,2])/sum(loan)*100
fprate <- loan[2,1]/(loan[2,1]+loan[1,1])*100
AUC <- (100+sensitivity-fprate)/2
performance <- data.frame(sensitivity,specificity,accuracy,AUC)
performance## sensitivity specificity accuracy AUC
## 1 88.0234 78.20268 86.44666 71.77817
Sensitivity or True Positive Rate (TPR) is percentage of actual positives which are correctly identified. In model, TPR value shows that we detect correctly default loan status 88.02%.
While, Specificity or True Negative Rate (TNR) percentage of actual negatives will test negative. It can be shown FPR value that we detect correctly non default loan status 78.20%.
The classification model has a good prediction capability to default loan status and has a good enough prediction capability to non default loan status.
Accuracy 86.45% is good for correctly predicted using binary logistic regression model.